Introduction to Open Data Science - Course Project

About the project

Initial feelings, expectations

I am looking forward to the course for two reasons:

  • would like to get a structured introduction to how R is used for data science, as I have learnt this previously on my own and with much help from a friend of mine
  • hopefully learn some new features too

Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.

I have found the course in Sisu because I still needed one credit to my Transferable skills section.

Learning experience on the R for Health Data Science

## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
  1. the usage of logical operators in the filter function filter(year == 2020 | disease == COVID19)
  2. I tend to forget how to dodge columns but I was reminded and produced some nice plots
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp, fill = country)) +
  geom_col(position="dodge") + theme(legend.position = "none") 

  1. I was not aware of the percent() function in tidyverse, it’s much simpler than sprintf(%2.1f%%).
  2. It was nice to learn that with plotly and html output of Rmarkdown one can create interactive htmls (as one can hover over the diagram below to highlight countries).
plot = gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
  geom_point()
  
ggplotly(plot)

Regression analysis

Task description

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.

Analysis

Data characteristics

We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.

  df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
  
  learning2014 = read_csv('data/learning2014.csv') %>% 
    rename(points = Points, attitude = Attitude)

The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.

  str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ Age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   Age = col_double(),
##   ..   Attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   Points = col_double()
##   .. )

Graphical overview

Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.

  colMap = c("F" = "#FF5555", "M" = "1100EE")
  # create a more advanced plot matrix with ggpairs()
  ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
           upper = list(continuous = wrap("cor")) ) +
    scale_color_manual(values = colMap) +
    scale_fill_manual(values = colMap)

We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.

Regression model

In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.

The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.

So fit the model again only with attitude (as that was the only one showing significance)

Interpret regression parameters

my_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.

Graphical model interpretation

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.

Prediction

Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.

new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)

# Print out the new data
new_data
##        attitude
## Mia         3.8
## Mike        4.4
## Riikka      2.2
## Pekka       2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
##      Mia     Mike   Riikka    Pekka 
## 25.03390 27.14918 19.39317 21.86099

Logistic regression

1. Create the file

As one can see it.

2. Data characteristics

We got data on student performances at school, including Grades, number of previous class failures and connected background information such as family status, alcohol consumption etc. We had information from 2 classes: Math and Portugese. To create a joint dataset we round-averaged the grades from these 2 classes and included only students that were present in both classes (at least based on their attributes, the table does not contain unique student identifiers causing possible confusions).

# Double check the data wrangling result
  df_dc = as_tibble(read.csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv', sep = ','))
  
  df = read_csv('data/student_joint.csv')
  #full_join(df,df_dc)
  # ok. this is of same size, e.g the tables must agree.

After these operations there are 370 students with 35 variables in the data.

  str(df)
## spec_tbl_df [370 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr [1:370] "R" "R" "R" "R" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr [1:370] "T" "T" "T" "T" ...
##  $ Medu      : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr [1:370] "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr [1:370] "other" "other" "other" "health" ...
##  $ reason    : chr [1:370] "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr [1:370] "mother" "mother" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "yes" "yes" "no" "yes" ...
##  $ romantic  : chr [1:370] "no" "yes" "no" "no" ...
##  $ famrel    : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
##  $ failures  : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
##  $ absences  : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
##  $ paid      : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ alc_use   : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   paid = col_character(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )

3. Hypothesis

The 4 variables picked to examine if they have correlation with high alcohol consumption. Below I explain my a-priori expectations.

  • studytime: likely the less someone study the more time is available for consuming alcohol. It is unlikely that someone would use alcohol to mitigate stress coming from too much learning.
  • activities: the more activities someone does the less time they have for hanging out with bad student groups, I expect again negative correlation
  • higher (meaning higher education plans): Student’s with cleaner future planning are less likely to be heavy alcohol consumers.
  • freetime: probably students with too much free-time will get more bored hence consumer more alcohol to make their life exciting.

4. Verification

   p1 = df %>% ggplot() + aes(x = high_use, y = studytime, fill = high_use) + geom_violin()
   p2 = df %>% ggplot() + aes(x = activities, fill = high_use) + geom_bar(position = "fill")
   p3 = df %>% ggplot() + aes(x = higher, fill = high_use) + geom_bar(position = "fill")
   p4 = df %>% ggplot() + aes(x = high_use, y = freetime, fill = high_use) + geom_violin()
   grid.arrange(p1,p2,p3,p4)

  • The first violin-plot shows that studytime is affecting the high alcohol usage in such a way that those who study a lot are less likely to be high alcohol users, however there are exceptions. For shorter study hours there seems to be no big difference in the distribution of high and low alcohol users. My hypothesis was not exactly this, but along these lines. It is not necessary that little studying causes high alcohol consumption but a lot of study hours seems to prevent it.
  • Regarding extra-curricular activities, there is a slight increase of the proportion of high alcohol users in the no-hobby group, but this doesn’t look very significant. This is again in line with my hypothesis but the effect is much smaller.
  • Regarding the higher education plans there is a clear correlation: in the group planning further education the proportion of high alcohol users is much smaller than in the other group. This is in line with my hypothesis.
  • Freetime is in some sense complementary to studytime, and the difference is also reflected here. Similarly, a lot of freetime doesn’t necessarily mean high alcohol usage, but very little free time seems to prevent it. As in case 1, this was not my hypothesis exactly, but some sort of inversion of it.

5. Logistic regression

  m = glm(high_use ~ studytime + activities + higher + freetime, data = df, family = "binomial")

  summary(m)
## 
## Call:
## glm(formula = high_use ~ studytime + activities + higher + freetime, 
##     family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6471  -0.8668  -0.6585   1.1902   2.1362  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.0841     0.7071  -0.119 0.905327    
## studytime      -0.5273     0.1577  -3.344 0.000826 ***
## activitiesyes  -0.2283     0.2400  -0.951 0.341541    
## higheryes      -0.7545     0.5398  -1.398 0.162162    
## freetime        0.3340     0.1246   2.680 0.007359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 423.64  on 365  degrees of freedom
## AIC: 433.64
## 
## Number of Fisher Scoring iterations: 4

The model suggests that the studytime and freetime variables are statistically significantly useful for predicting high-low alcohol usage. It is interesting that the higher education prediction is not significant, but I double checked the data, and it is severely skewed towards students planning higher education (16 students not planning while 354 planning). This means that this variable has not much affect on predicting the higher education planning but yet high alcohol consumers. The categorical variables (activities and higher) were converted internally to “dummy” representative levels. As there was only 2 levels in both cases only 1 dummy variable was created.

6. Prediction evaluation

coef(m)
##   (Intercept)     studytime activitiesyes     higheryes      freetime 
##   -0.08410415   -0.52727634   -0.22825310   -0.75452928    0.33399121
exp(coef(m))
##   (Intercept)     studytime activitiesyes     higheryes      freetime 
##     0.9193355     0.5902103     0.7959228     0.4702319     1.3965309

The coefficients represent the correlation between the variable and the output, the sign represents the shape of the logistic function, i.e. negative coefficient means that higher value (e.g. studytime) results in lower output (i.e. no high alcohol usage). In the exponential form these represent odds ratios for the unit change, in other words associating the variable’s importance in effecting the model. Values close to 1 would have no effect, deviations in either direction shows that the variable is important for the prediction.

confint(m)
## Waiting for profiling to be done...
##                     2.5 %     97.5 %
## (Intercept)   -1.47694251  1.3160333
## studytime     -0.84575068 -0.2261421
## activitiesyes -0.70031246  0.2420972
## higheryes     -1.84750298  0.3028107
## freetime       0.09312641  0.5827670

These are 95% confidence intervals. It is usually understood that if these are not containing 0 then it is a significant result and indeed this matches with the significance table of the model. The results are matching with my hypothesis except for the higher education variable but that is explained by the skewing.

# Use only the significant variables
  m2 = glm(high_use ~ studytime + freetime, data = df, family = "binomial")
  probabilities <- predict(m2, type = "response")
  
  df %<>% mutate(probability = probabilities)
  df %<>% mutate(prediction = if_else(probability>0.5,TRUE,FALSE))
  
  # tabulate the target variable versus the predictions
  tt = table(high_use = df$high_use, prediction = df$prediction)
  
  tt
##         prediction
## high_use FALSE TRUE
##    FALSE   250    9
##    TRUE    102    9

This table is also called the confusion matrix. The usual metrics calculated from this table are the precision (TP/(TP+FP)) that is 0.5. This is pretty bad. Another common metric is recall (TP/(TP+FN)) that is 0.0810811 that is almost 0. Accuracy (the correct preddiictions / total predictions) was a bit better: 0.7. The training error is naturally 1-accuracy i.e. 0.3. This model performs pretty bad on predicting the high alcohol users, but pretty good on predicting the actually non high consumers. Simple guessing (without any other knowledge) should give a value around 0.5 hence this model is still a little better than random guess.